Search CORE

10 research outputs found

Bounds for Off-policy Prediction in Reinforcement Learning

Author: Joseph Ajin George
Bhatnagar Shalabh
Publication venue: IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA
Publication date: 01/03/1981
Field of study

In this paper, we provide for the first time, error bounds for the off-policy prediction in reinforcement learning. The primary objective in off-policy prediction is to estimate the value function of a given target policy of interest using the linear function approximation architecture by utilizing a sample trajectory generated by a behaviour policy which is possibly different from the target policy. The stability of the off-policy prediction has been an open question for a long time. Only recently, could Yu provide a generalized proof, which makes our results more appealing to the reinforcement learning community. The off-policy prediction is useful in complex reinforcement learning settings, where the sample trajectory is hard to obtain and one has to rely on the sample behaviour of the system with respect to an arbitrary policy. We provide here error bound on the solution of the off-policy prediction with respect to a closeness measure between the target and the behaviour policy

Open Access Repository of IISc Research Publications

Caltech Authors

An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method

Author: Bhatnagar Shalabh
Joseph Ajin George
Publication venue: SPRINGER, VAN GODEWIJCKSTRAAT 30, 3311 GZ DORDRECHT, NETHERLANDS
Publication date
Field of study

In this paper, we provide two new stable online algorithms for the problem of prediction in reinforcement learning, i.e., estimating the value function of a model-free Markov reward process using the linear function approximation architecture and with memory and computation costs scaling quadratically in the size of the feature set. The algorithms employ the multi-timescale stochastic approximation variant of the very popular cross entropy optimization method which is a model based search method to find the global optimum of a real-valued function. A proof of convergence of the algorithms using the ODE method is provided. We supplement our theoretical results with experimental comparisons. The algorithms achieve good performance fairly consistently on many RL benchmark problems with regards to computational efficiency, accuracy and stability

Open Access Repository of IISc Research Publications

Revisiting the Cross Entropy Method with Applications in Stochastic Global Optimization and Reinforcement Learning

Author: Bhatnagar Shalabh
Joseph Ajin George
Publication venue
Publication date
Field of study

In this paper, we provide a new algorithm for the problem of stochastic global optimization where only noisy versions of the objective function are available. The algorithm is inspired by the well known cross entropy (CE) method. The algorithm takes the shape of a multi-timescale stochastic approximation algorithm, where we reuse the previous samples based on discounted averaging, and hence it saves the overall computational and storage cost. We provide proof of the stability and the global optimization property of our algorithm. The algorithm shows good performance on the noisy versions of global optimization benchmarks and outperforms a state-of-the-art algorithm for non-linear function approximation in reinforcement learning

Open Access Repository of IISc Research Publications

An incremental off-policy search in a model-free Markov decision process using a single sample path

Author: Bhatnagar Shalabh
Joseph Ajin George
Publication venue: SPRINGER, VAN GODEWIJCKSTRAAT 30, 3311 GZ DORDRECHT, NETHERLANDS
Publication date
Field of study

In this paper, we consider a modified version of the control problem in a model free Markov decision process (MDP) setting with large state and action spaces. The control problem most commonly addressed in the contemporary literature is to find an optimal policy which maximizes the value function, i.e., the long run discounted reward of the MDP. The current settings also assume access to a generative model of the MDP with the hidden premise that observations of the system behaviour in the form of sample trajectories can be obtained with ease from the model. In this paper, we consider a modified version, where the cost function is the expectation of a non-convex function of the value function without access to the generative model. Rather, we assume that a sample trajectory generated using a priori chosen behaviour policy is made available. In this restricted setting, we solve the modified control problem in its true sense, i.e., to find the best possible policy given this limited information. We propose a stochastic approximation algorithm based on the well-known cross entropy method which is data (sample trajectory) efficient, stable, robust as well as computationally and storage efficient. We provide a proof of convergence of our algorithm to a policy which is globally optimal relative to the behaviour policy. We also present experimental results to corroborate our claims and we demonstrate the superiority of the solution produced by our algorithm compared to the state-of-the-art algorithms under appropriately chosen behaviour policy

Open Access Repository of IISc Research Publications

A Model based Search Method for Prediction in Model-free Markov Decision Process

Author: Bhatnagar Shalabh
Joseph Ajin George
Publication venue: IEEE, 345 E 47TH ST, NEW YORK, NY 10017 USA
Publication date
Field of study

In this paper, we provide a new algorithm for the problem of prediction in the model-free MDP setting, i.e., estimating the value function of a given policy using the linear function approximation architecture, with memory and computation costs scaling quadratically in the size of the feature set. The algorithm is a multi-timescale variant of the very popular cross entropy (CE) method which is a model based search method to find the global optimum of a real-valued function. This is the first time a model based search method is used for the prediction problem. A proof of convergence using the ODE method is provided. The theoretical results are supplemented with experimental comparisons. The algorithm achieves good performance fairly consistently on many benchmark problems

Crossref

Open Access Repository of IISc Research Publications

An online prediction algorithm for reinforcement learning with linear function approximation using cross entropy method

Author: A Benveniste
A Nedić
Ajin George Joseph
B Kveton
C Dann
DJ White
DP Bertsekas
HJ Kushner
I Menache
J Hu
J Hu
J Kober
JA Boyan
JN Tsitsiklis
K Doya
L Ljung
L Perko
LP Kaelbling
M Dorigo
M Zlochin
MG Lagoudakis
PT Boer De
RS Sutton
RS Sutton
RY Rubinstein
S Kullback
Shalabh Bhatnagar
SJ Bradtke
VS Borkar
VS Borkar
VS Borkar
VS Borkar
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

An incremental off-policy search in a model-free Markov decision process using a single sample path

Author: A Antos
ADMS Barreto
AG Barto
Ajin George Joseph
AW Moore
B Wang
BT Polyak
BW Balleine
C Dann
DP Bertsekas
DP Bertsekas
DP Kroese
E Ertin
E Ikonen
EA Feinberg
G Alon
H Yu
HS Chang
I Menache
J Baxter
J Hu
J Hu
J Xue
JC Spall
JN Tsitsiklis
JP O’Doherty
M Sato
M Sato
M Zlochin
MG Lagoudakis
ML Puterman
P Fracasso
P Kumar
PW Glynn
R Rubinstein
RS Sutton
RS Sutton
RS Varga
RY Rubinstein
RY Rubinstein
S Bhatnagar
Shalabh Bhatnagar
SP Singh
SW Lee
VR Konda
VS Borkar
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref